Simulating

September 23 + 25, 2024

Jo Hardin

Agenda 9/23/24

  1. Why simulate?
  2. What makes a good simulation?
  3. Examples

Goals of Simulating Complicated Models

The goal of simulating a complicated model is not only to create a program which will provide the desired results. We also hope to be able to write code such that:

  1. The problem is broken down into small pieces.
  2. The problem has checks in it to see what works (run the lines inside the functions!).
  3. uses simple code (as often as possible).

Aside: a few R functions (ifelse())

data.frame(value = c(-2:2)) %>%
  mutate(abs_value = ifelse(value >=0, value, -value))  # abs val
  value abs_value
1    -2         2
2    -1         1
3     0         0
4     1         1
5     2         2

Aside: a few R functions (case_when())

set.seed(4747)
diamonds %>% select(carat, cut, color, price) %>%
  sample_n(20) %>%
  mutate(price_cat = case_when(
    price > 10000 ~ "expensive",
    price > 1500 ~ "medium", 
    TRUE ~ "inexpensive"))
# A tibble: 20 × 5
   carat cut       color price price_cat  
   <dbl> <ord>     <ord> <int> <chr>      
 1  1.23 Very Good F     10276 expensive  
 2  0.35 Premium   H       706 inexpensive
 3  0.7  Good      E      2782 medium     
 4  0.4  Ideal     D      1637 medium     
 5  0.53 Ideal     G      1255 inexpensive
 6  2.22 Ideal     G     14637 expensive  
 7  0.3  Ideal     G       878 inexpensive
 8  1.05 Ideal     H      4223 medium     
 9  0.53 Premium   E      1654 medium     
10  1.7  Ideal     H      7068 medium     
11  0.31 Good      E       698 inexpensive
12  0.31 Ideal     F       840 inexpensive
13  1.03 Ideal     H      4900 medium     
14  0.31 Premium   G       698 inexpensive
15  1.56 Premium   G      8858 medium     
16  1.71 Premium   G     11032 expensive  
17  1    Good      E      5345 medium     
18  1.86 Ideal     J     10312 expensive  
19  1.08 Very Good E      3726 medium     
20  0.31 Premium   E       698 inexpensive

Aside: a few R functions (sample())

sampling, shuffling, and resampling: sample()

set.seed(47)
alph <- letters[1:10]
alph
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
sample(alph, 5, replace = FALSE) # sample (from a population)
[1] "i" "b" "g" "d" "a"
sample(alph, 5, replace = TRUE) # sample (from a population)
[1] "j" "g" "f" "i" "f"
sample(alph, 10, replace = FALSE)  # shuffle
 [1] "f" "h" "i" "e" "g" "d" "c" "j" "b" "a"
sample(alph, 10, replace = TRUE)  # resample
 [1] "e" "j" "e" "b" "e" "c" "f" "a" "e" "a"

Why?

Three simulating methods are used for different purposes:

  1. Monte Carlo methods - use repeated sampling from a population with known characteristics.

  2. Randomization / Permutation methods - use shuffling (sampling without replacement from a sample) to test hypotheses of “no effect”.

  3. Bootstrap methods - use resampling (sampling with replacement from a sample) to establish confidence intervals.

Aside: a few R functions (set.seed())

What if we want to be able to generate the same random numbers (here on the interval from 0 to 1) over and over?

set.seed(4747)
runif(4, 0, 1) # random uniform numbers
[1] 0.1949071 0.3390270 0.5147919 0.4516470
set.seed(123)
runif(4, 0, 1) # random uniform numbers
[1] 0.2875775 0.7883051 0.4089769 0.8830174
set.seed(4747)
runif(4, 0, 1) # random uniform numbers
[1] 0.1949071 0.3390270 0.5147919 0.4516470

Small simulation problem

  • Simulation as easier than calculus for estimating probabilities.
  • Simulation to understand complicated models.
  • Simulation to consider methods with fewer assumptions.

Sally & Joan

Consider a situation where Sally and Joan plan to meet to study in their college campus center (Mosteller 1987; Baumer, Kaplan, and Horton 2021). They are both impatient people who will wait only 10 minutes for the other before leaving.

But their planning was incomplete. Sally said, “Meet me between 7 and 8 tonight at the student center.” When should Joan plan to arrive at the campus center? And what is the probability that they actually meet?

Simulate their meeting

Assume that Sally and Joan are both equally likely to arrive at the campus center anywhere between 7pm and 8pm.

library(tictoc)
n <- 10000

tic()
meet_vect <- tibble(
  sally = runif(n, 0, max = 60),
  joan = runif(n, min = 0, max = 60),
  result = ifelse(
    abs(sally - joan) <= 10, "They meet", "They do not"
  )
)
toc()
0.002 sec elapsed
meet_func <- function(nada){
  tibble(
    sally = runif(1, min = 0, max = 60),
    joan = runif(1, min = 0, max = 60),
    result = ifelse(
    abs(sally - joan) <= 10, "They meet", "They do not"
  )
  )
}
tic()
meet_map <- 1:n |> 
  map(meet_func) |> 
  list_rbind()
toc()
2.42 sec elapsed
tic()
meet_for <- data.frame()
for(i in 1:n){
  meet_for <- meet_for |> rbind(
    tibble(
      sally = runif(1, min = 0, max = 60),
      joan = runif(1, min = 0, max = 60),
      result = ifelse(
    abs(sally - joan) <= 10, "They meet", "They do not"
  )))
}
toc()
3.978 sec elapsed

Results

The results themselves are equivalent. Differing values due to randomness in the simulation.

meet_vect |> 
  group_by(result) |> 
  summarize(number = n()) |> 
  mutate(proprotion = number / sum(number))
# A tibble: 2 × 3
  result      number proprotion
  <chr>        <int>      <dbl>
1 They do not   6966      0.697
2 They meet     3034      0.303
meet_map |> 
  group_by(result) |> 
  summarize(number = n())|> 
  mutate(proprotion = number / sum(number))
# A tibble: 2 × 3
  result      number proprotion
  <chr>        <int>      <dbl>
1 They do not   7087      0.709
2 They meet     2913      0.291
meet_for |> 
  group_by(result) |> 
  summarize(number = n())|> 
  mutate(proprotion = number / sum(number))
# A tibble: 2 × 3
  result      number proprotion
  <chr>        <int>      <dbl>
1 They do not   6942      0.694
2 They meet     3058      0.306

Visualizing the meet up

meet_map |> 
  ggplot(aes(x = joan, y = sally, color = result)) +
  geom_point(alpha = 0.3) + 
  geom_abline(intercept = 10, slope = 1) + 
  geom_abline(intercept = -10, slope = 1) + 
  scale_color_brewer(palette = "Paired")

Simulation best practices

  • no magic numbers
  • comment your code
  • use informative names
  • set a seed for reproducibility

Bigger simulation

References

Baumer, Ben, Daniel Kaplan, and Nicholas Horton. 2021. Modern Data Science with r. CRC Press. https://mdsr-book.github.io/mdsr2e/.
Mosteller, Frederick. 1987. Fifty Challenging Problems in Probability with Solutions. Dover Publications: Mineola, NY.